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Overview 


"= Finite-state automata (FSA) — What for? 
— Recap: Chomsky hierarchy of grammars and languages 
— FSA, regular languages and regular expressions 
— Appropriate problem classes and applications 
" [inite-state automata and algorithms 
— Regular expressions and FSA 


— Deterministic (DFSA) vs. non-deterministic (NFSA) finite-state 
automata 


— Determinization: from NFSA to DFSA 
— Minimization of DFSA 


= Extensions: finite-state transducers and FST operations 


Finite-state automata: What for? 


Chomsky Hierarchy of Hierarchy of Grammars and 
Languages Automata 
= Regular languages = Regular PS grammar 
(Type-3) Finite-state automata 
=  Context-free languages = Context-free PS grammar 
(Type-2) Push-down automata 
"= Context-sensitive languages " Tree adjoining grammars 
(Type-1) Linear bounded automata 
" Type-0 languages " General PS grammars 


Turing machine 


computationally more complex 
less efficient 


Finite-state automata model regular languages 
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Finite-state automata model regular languages 


Regular 
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describe/specify Regular 

recognize/generate 


languages 


= 


° properties of regular languages 
° appropriate problem classes 


° algorithms for FSA 


Languages, formal languages and grammars 


Alphabet 2 : finite set of symbols 
String : sequence x, ... x, of symbols x, from the alphabet > 


— Special case: empty string 8 
Language over Z : the set of strings that can be generated from X 


— Sigma star >* : set of all possible strings over the alphabet > 
= {a,b} >*= {€ a, b, aa, ab, ba, bb, aaa, aab, ...[ 
— Sigma plus >+: + = X* -{e} 
— Special languages: © = {} (empty language) # {£} (language of empty string) 


Strings 


A formal language : a subset of >* 
Basic operation on strings: concatenation * 
DA di oS A 


— Ifa=x,...x, andb=x,,,...x, thena-b—-ab =x,. 
— Concatenation is associative but not commutative 


— €1s identity element: a€ = £a =a 
A grammar of a particular type generates a /anguage of a corresponding type 


Recap on Formal Grammars and Languages 


" A formal grammar is a tuple G=<> , 9 , S, R> 
— X alphabet of terminal symbols 
— @ alphabet of non-terminal symbols (> A ® =Ø) 
— ₪ the start symbol 


— R finite set of rules RCT * xT * ofthe form a D 
where I' = > U ® and 6 = £ and & € X* 


= The language L(G) generated by a grammar G 

— set of strings w c X* that can be derived from S according to G=<> ,®, S, R> 
" Derivation: given G=<>, ®, S, R> and uv e 1* = (Eu ®)* 

— a direct derivation (1 step) w = çv holds iff 

u,, u, 6 I* exist such that w = u,a u, and v = u;D u, anda — D € R exists 
— aderivation w =,.v holds iff either w = v 
or 2 € I“ exists such that w —(.z and z v 
"^ A language generated by a grammar G: L(G) = {w : S =,,w & w 6 X*j 
I.e., L(G) strongly depends on R ! 


Chomsky Hierarchy of Grammars 


" Classification of languages generated by formal grammars 

— A language is of type i (i = 0,1,2,3) iff it is generated by a type-i grammar 

— Classification according to increasingly restricted types of production rules 
L-type-0 > L-type-1 > L-type-2 > L-type-3 

— Every grammar generates a unique language, but a language can be generated 
by several different grammars. 

— Two grammars are 

= (Weakly) equivalent if they generate the same string language 


= Strongly equivalent if they generate both the same string language 
and the same tree language 


Chomsky Hierarchy of Grammars 


Type-0 languages: general phrase structure grammars 


no restrictions on the form of production rules: 
arbitrary strings on LHS and RHS of rules 
A grammar G = <>, ®, S, R> generates a language L-type-0 iff 
— all rules R are of the form & — B, where ae I+ and B € I* (with r = Xu $) 


— Le. LHS a nonempty sequence of NT or T symbols with at least one NT 
symbol and RHS a possibly empty sequence of NT or T symbols 


Example: 

G = X(S,A,B,C,D,E), {a},S,R>, L(G) = fa? | n21] 
S  ACaB. CB E. aE — Ea. 

Ca — aaC. aD — Da. AE O 6. 

CB — DB. ADAC. 


a” = aaaa 6 L(G) iff S —* aaaa 


Chomsky Hierarchy of Grammars 


Type-1 languages: context-sensitive grammars 
" A grammar G = «X, P, S, R> generates a language L-type-1 iff 
— all rules R are of the form 04? — oy, or S > € (with no S symbol on RHS) 
where A 6 Panda Bye T*(T=LU®), PFE 


— Le., LHS: non-empty sequence of NT or T symbols with at least one NT 
symbol 


and RHS a nonempty sequence of NT or T symbols (exception: S — €) 
— For all rules LHS — RHS : |LHS| < |RHS| 
= Example: 
L-1ía'b'c|n21j 
" R={S—aSBC, aB—ab, 
SaBC, bBbb, 
C B — BC, bCobc, cC—cc; 
a?b?c = aaabbbccc e L(G) iff S —* aaabbbccc 


Chomsky Hierarchy of Grammars 


Type-2 languages: context-free grammars 
" A grammar G = <>, ®, S, R> generates a language L-type-2 iff 
— all rules R are of the form A — ©, 
where 4 6 b and «e I* (=> v $) 


— Ie., LHS: a single NT symbol; RHS a (possibly empty) sequence of NT or T 
symbols 


= Example: 
L={abaæ|n2>l} 
R={S>ASA,S—-b,.A-a} 


Chomsky Hierarchy of Grammars 


Type-3 languages: regular or finite-state grammar 


A grammar G = <>, ®, S, R> is called right (left) linear (or regular) iff 
— allrules R are of the form 


" 4 — vw or À — wB (or A => Bw), where A,B e ® and w e X* 


— i.e., LHS: a single NT symbol; RHS: a (possibly empty) sequence of T symbols, 
optionally followed (preceded) by a NT symbol 


* Example: S 
Z={a,b} ilb" 
d= { S, A, B} en 
R={S—aA, B-bB, b N 
A-aA, Bb ga 
A—bbB } b b B 
9 S 
S=aA=—aaA=—aabbB=—aabbbB=— aabbbb b B 


Operations on languages 


= Typical set-theoretic operations on languages 
— Union: L, Ü L, = í w : שש‎ L, or weL, } 
— Intersection: L NL, = { w : we L, and we L, } 
— Difference: L,- L, = { w : we L, and we L, | 
— Complement of L c X* wrt. &*: L = >*-L 
= [anguage-theoretic operations on languages 
— Concatenation: L L, = {w,w, : w€ L, and w,eL,} 
=, Iteration: L- 16), L'=L LLL; «; Leu. EE, buy 
— Mirror image: L! = {w!: we L} 
" Union, concatenation and Kleene star are called regular operations 
" Regular sets/languages: languages that are defined by the regular 
operations: concatenation (-) , union (U) and kleene star (4) 


= Regular languages are closed under concatenation, union, kleene star, 
intersection and complementation 
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Regular languages, regular expressions and FSA 
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Regular languages and regular expressions 


= Regular sets/languages can be specified/defined by regular expressions 
Given a set of terminal symbols >, the following are regular expressions 


€ isa regular expression 
For every a € È, a is a regular expression 
If R is a regular expression, then R* is a regular expression 


If Q,R are regular expressions, then QR (Q : R) and Q U R are regular 
expressions 


" Every regular expression denotes a regular language 


L(E) = {€} 

L(a) = {a} for all a e X 
L(ap) = L(o)L(5) 

L(a p) = L(a) o L(f) 
L(a*) = L(a)* 


Finite-state automata (FSA) 


" Grammars: generate (or recognize) languages 
Automata: recognize (or generate) languages 


" Finite-state automata recognize regular languages 
= A finite automaton (FA) is a tuple A = «o,X, ó, q,,F> 
— © 8 finite non-empty set of states 
— a finite alphabet of input letters 
— ða transition function D x X — ® 
— qe ® the initial state 
— Fc 6 the set of final (accepting) states 
" Transition graphs (diagrams): 


— States: circles © pe e 
— transitions: directed arcs between circles 5 à 0 òp, a) = q 
— initial state 5 P = Qo 
— final state 5 ץ‎ 7 


FSA transition graphs and traversal 


" Transition graph 


Bc q F = lq. de } 
Transition function 6: ® x X — 
9(qo.c)-q, 

CHOSE 

ó(q,l)=q, 

ó(q,)=q, 

9(q5,e)-q; 

9(q5,3)-q, 

9(q5.v)7q, 

(q,.r)-q; 

9(q,.e)-q; 

9(q;.t)-q; 
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" Traversal of an FSA 
— Computation with an FSA 


FSA transition graphs and traversal 


" Transition graph State diagram 


" Traversal of an FSA 
— Computation with an FSA 


FSA can be used for 
* acceptance (recognition) 
* generation 


FSA traversal and acceptance of an Input string 


= Traversal of a (deterministic) FSA 


— FSA traversal is defined by states and transitions of A, 
relative to an input string we X* 


— A configuration of A 1s defined by the current state and the unread part of the 
input string: (q, Wi), with qe D, w; suffix of w 
— A transition: a binary relation between configurations 
(q,w;) I-, (q’,w;,,) iff. w, = zw,,, for ze and ö(q,2)= q' 
(q,w,) yields (q’,w,,,) in a single transition step 
— Reflexive, transitive closure of |-,: (q, wi) |-*, (q^, w) 
(q, wi) yields (q', w,) in zero or a finite number of steps 
" Acceptance 
— Decide whether an input string w is in the language L(A) defined by FSA A 
— An FSA A accepts a string w iff (quw) |-*, (q, €), with q, initial state, q,< F 


— The language L(A) accepted by FSA A is the set of all strings accepted by A 
Le., we L(A) iff there is some q,< F, such that (quw) |-*, (qp €) 


Regular grammars and Finite-state automata 


= A grammar G = «X, ®, S, R> is called right linear (or regular) iff 
all rules R are of the form À — w or À — wB, where A,B € ® and w e X* 
— X={a,b}, P={S,A,B}, R={S aA, A aA, A — bbB, B > bB, B bj 
S = aA = aaA = aabbB = aabbbB = aabbbb 


— The NT symbol corresponds to a state in an FSA: the future of the derivation only 
depends on the identity of this state or symbol and the remaining production 


rules. S 
— Correspondence of type-3 grammar rules —_ pe 
with transitions in a (non-deterministic) FSA: a A 
"= AS wB = S(A,w)=B cdd 
* >W = Q(A,w)=q, qe $ B PF 
— Conversely, we can construct an FSA b b B 
from the rules of a type-3 language er, 
= Regular grammars and FSA can be shown to be equivalent b B 


" Regular grammars generate regular languages | 
= Regular languages are defined by concatenation, union, kleene star 


Deterministic finite-state automata 


= Deterministic finite-state automata (DFSA) 


— at each state, there is at most one transition that can be taken to read the 
next input symbol 


— the next state (transition) 1s fully determined by current configuration 
— 61s functional (and there are no €-transitions) 
= Determinism is a useful property for an FSA to have! 


— Acceptance or rejection of an input can be computed in linear time O(n) for 
inputs of length n 


— Especially important for processing of LARGE documents 
= Appropriate problem classes for FSA 


— Recognition and acceptance of regular languages, 
in particular string manipulation, regular phonological and morphological 
processes 


— Approximations of non-regular languages in morphology, shallow finite- 
state parsing, ... 


Multiple equivalent FSA 


= FSA for the language L= 1 lehrbar, lehrbarkeit, belehrbar, 
belehrbarkeit, unbelehrbar, unbelehrbarkeit, unlehrbar, unlehrbarkeit } 
= DFSA for Lin be lehr 


Se ONE 


= Regular expression and FSA for L,,,,.(un | €) (be lehr | lehr) bar (keit | £) 
(non-deterministic) 


= Equivalent FSA 


un be 
(non-deterministic) d. H.E isi e /" 6 - . 


Defining FSA through regular expressions 


= FSA for even mildly complex regular languages are best constructed 
from regular expressions! 
" Every regular expression denotes a regular language 
- Ho-i6 ° L(aB)=L(œL(B) 
- L(a)= {a} foralaeZ + L(yof) =) O L(f) 
° L(a*) = L(o)* 
" Every regular expression translates to a FSA (Closure properties) 
— An FSA for a (with L(a) = (aj), a € >: 
— An FSA for e (with L(£) = {e }) £e È: 
— Concatenation of two FSA F, and F,;: 


= X} p=}, (> initial state) b © 


= P p= P, (@ set of final states) 


Hé 


V ðr = Ó, U Óp U {5(<q;,€>.q;) [qe $,q-72j 


Defining FSA through regular expressions 


— union of two FSA F, and F;: 
= Sip— S, (new state) 
= Fig { 8; (new state) 
V 6. = 0, U % 
U {0(<q,€7,4,) | do = Sj, ( q,= S, org, = Sp); 
U {0(<q,,€>,q) | (4eF, or qe Fy), qe Pin} 


— Kleene Star over an FSA F, : AE 
= S,«= S, (new state) 6 Ór, O e 
* F,.= { q; } (new state) NES 
V ð Ó, U 
U {0(<q.Ee7,q,) |q; € Fa q= S.) 
U 16(<q,£>,q,) | do = Sax ( q,= SL orq, = Fas)} 
U 18(<q,E>,q,) | EF, QE Fast 


Defining FSA through regular expressions 


(ab U aba)” 


= &-transition: move to 6(q, €) without reading an input symbol 
=" FSA construction from regular expressions yields 
a non-deterministic FSA (NFSA) 
— Choice of next state 15 only partially determined by the current configuration, 
i.e., we cannot always predict which state will be the next state in the traversal 


Non-deterministic finite-state automata (NFSA) 


= Non-determinism 
* Introduced by €-transitions and/or 
* Transition being a relation A over x X* x ®, i.e. a set of triples <q. Zur 
Equivalently: Transition function ó maps to a set of states: 6: b x X > (o(d) 
= Anon-deterministic FSA (NFSA) is a tuple A = <@,>, à, qp F> 
* ® a finite non-empty set of states 
* % a finite alphabet of input letters 
* õa transition function b x * > (d) (or a finite relation over ® x X* x D) 
* qo € P the initial state 
+ Fc 6 the set of final (accepting) states 
= Adapted definitions for transitions and acceptance of a string by a NFSA 
+ (qw) H, (q’,w,,,) iff w, = zw. for zeX* and q'e Ô(q,z) 
* An NDFA (w/o £) accepts a string w iff there is some traversal such that 
(qw) |-*, (q’, €) and q' GF. 


* Astring w is rejected by NDFA A iff A does not accept w, 
i.e. all configurations of A for string w are rejecting configurations! 


Non-determinism in FSA 


Non-determinism in FSA 


(ab U aba)" 


Non-determinism in FSA 


Non-determinism in FSA 


(ab U aba)" 


Non-determinism in FSA 


(ab U aba)" 


Equivalence of DFSA and NFSA 


" Despite non-determinism, NFSA are not more powerful than DFSA: 
they accept the same class of languages: regular languages 


" Forevery non-deterministic FSA there is deterministic FSA that 
accepts the same language (and vice versa) 


— The corresponding DFSA has in general more states, in which it models 
the sets of possible states the NFSA could be in in a given traversal 


= There is an algorithm (via subset construction) that allows conversion 
of an NFSA to an equivalent DFSA 


Efficiency considerations: an FSA 1s most efficient and compact iff 
"= tisa DFSA (efficiency) — Determinization of NFSA 


" [tis minimal (compact encoding) — Minimization of FSA 


Equivalence of DFSA and NFSA 


= FSA A, and A, are equivalent iff L(A,) = L(A,) 
"= Theorem: for each NFSA there is an equivalent DFSA 
= Construction: A = < ®, >, ó, q,, F > a NFSA over X 


— define eps(q) = {pe ®|(q, & p) 66 ( 

— define an FSA A‘= «o, Y, 8°, q,’,F’> over sets of states, with 
®’={B|Bc ®} 
do =teps(qo)} 
6')2,8( =U + eps(p) | q €B and 3 pe B such that (q, a, p) e 6} 
F-iBco|BnoFZzQ 

= A’ satisfies the definition of a DFSA. We need to show that L(A) = L(A’) 


= Define D(q,w):- (pe 9 |(q, w) (p. €) } and 


D(Q, w) = (P e @'|(Q,w) P" (P, 6(( 


Equivalence of DFSA and NFSA: Proof 


Prove: D(q,, w) = D'(4q +, w) by induction over length of w 
= |w| =0: by definition of D and D' 
= Induction step: |w| = k+1, w = v a, by hypothesis: 
ו סוכ‎ = DUG, y, v) = iP Dp... p. 35 P 
by def. of D: D(q,, w) UM -p {eps(q) | (p, a, q) € 8j 
by def. of 8: D'({p, p,, --- , p, }, a) ze -p teps(q) | (p, a, q) 6 6[ 
1t follows: 
D'(1q,j, W) = &'(D'(1q,j, w), a) = D'({P,, p,. + > P, }, 8) 
=U ep teps(q) | (p. a, q) € 8} =D(q,, w) q.e.d. 
. Finally, A and A' only accept if D'({q,}, w) = D(q,, w) contain a state €F 


Determinization by subset construction 


NFSA A-«o,Y, 0,q,,F> 


L(A)- a(ba)* ע\‎ a(bba)* 


AN=<@ Y, oq F> 


Subset construction: 


Compute ð’ from ó 
for all subsets S c and ae X s.th. 
0’(S,a) = + s’| Ise S s.th. (s,a,s‘)e 5} 


Determinization by subset construction 


NFSA A=<®,¥, 6,q,,F> A“=<@b',>, < 


®’= {B|Be {1,2,3,4,5,6} 

Gil}; 

9'(015,8)7 {2,3}, 90'(145,a)- {2}, 

© (115,5), 0°({4},b)= ©, 

0°({2,3},a)= Ø, 0°({3},a)= ©, 

0°({2,3},b)= {4,5}, 0°({3},b)= {5}, 

DE 7 8 2 

'(14,5],b)- {6}, ((5),b)- (6 

L(A)= a(ba)* U a(bba)* ES ד‎ 

9'(125,b)- {4}, P= 12350251095 


90'(165,a)- {3}, 
o'(16j,b)- ©, 


Determinization by subset construction 


NFSA A=<%@,>, ó,q,,F> 


AN=<@D >, Ò’, q, > 


®’= {B|Bc {1,2,3,4,5,6} 

gu. 

9'((13,3)7 2.3], 9'((45,3)- {2}, 
d’({1},b)=2, 0’({4},b)= Ø, 
0'((2,31,a)7 Ø, 6')13[ a= Ø, 
0623 DR ES 0) 20 {5}, 
0’({4,5},a)= {2}, 0')15[ ,8(= Ø, 
0’({4,5},b)= {6}, 0')15( =(פ,‎ {6} 
0’({2},a)= Ø, 

0’({2},b)= {4}, P= {42,3}, {2},13 33 


0°({6},a)= {3}, 
0°({6},b)= Ø, 


Determinization by subset construction 


NFSA A=<@,>, 6,q,,F> A“=<@b',>, 0’, q < 


®’= {B|Bc {1,2,3,4,5,6} 

ge 

9'(015,3)7 0,3], 9'((45,3)- {2}, 
o'((115,b)-O, 9'(14j,b)- Ø, 
0’({2,3},a)= Ø, 0’({3},a)= Ø, 
901235, 5) {4,5}, 0) 20 {5}, 
0’({4,5},a)= {2}, 0')15[ ,8(= Ø, 
0’({4,5},b)= {6}, 0')15( =(פ,‎ {6} 


90'((25,)- Ø, 
š : 0°({2},b)= {4}, P= 1235025095 
= o 90'(165,a)- {3}, 
0°({6},b)= ©, 


Determinization by subset construction 


NFSA A=<%@,>, ó,q,,F> 


MG) 


AN=<@D >, Ò’, q, > 


®’= {B|Bc {1,2,3,4,5,6} 

ge 

9'(015,3)7 0,3], 9'((45,3)- {2}, 
o'(115,b)-O, 9'(14j,b)- Ø, 
0'((2,31,a)7 Ø, 6')13[ a= Ø, 
0(2,3),b)= {4,5}, 0) 20 {5}, 
0'))4,5( ,4(= {2}, 0')15[ ,8(= Ø, 
0’({4,5},b)= {6}, 0')15( =(פ,‎ {6} 
0’({2},a)= Ø, 

0'))2[ =(פ,‎ {4}, P= {42,3}, {2},13 33 


0°({6},a)= {3}, 
0°({6},b)= Ø, 


Determinization by subset construction 


NFSA A=<®,%, ðq F> | DFSA A*=<@',>, 0’, qu; < 


L(A) = L(A’) = a(ba)* ע\‎ a(bba)* 


£-transitions and £-closure 


= Subset construction must account for £-transitions 


" c-closure 


— The £-closure of some state q consists of q as well as all states that 
can be reached from q through a sequence of £-transitions 
= qe £-closure(g) 
= [fre£-closure(q) and (r, €,q‘)e ó, then q'e &-closure(q), 
— £-closure defined on sets of states 


V £-closure(R) = =. e-closure(q) (with P c ®) 


" Subset construction for £-NFSA 


— Compute ð’ from 6 for all subsets S C® and ae X s.th. 
8 (S,a) = + s'"|3seS s.th. (s.a,s')e 6 and s’’e €-closure(s’) } 


Example 
= e-NFSA for (a|b)c* 


e-closure for all se ®: 
€-closure(0)={0,1,2}, 
€-closure(1)={1}, 
€-closure(2)={2}, 
e-closure(3)={3,5,6,7,9}, 
e-closure(4)={4,5,6,7,9}, 
e-closure(5)={5,6,7,9}, 
e-closure(6)={6,7,9}, 
€-closure(7)={7}, 
€-closure(8)={8,7,9}, 
€-closure(9)={9} 


Transition function over subsets 
0’({0},€)= {0,1,2}, 
8°({0,1,2},a)={3,5,6,7,9}, 
60'(10,1,2),b)7 {4,5,6,7,9}, 
0’({3,5,6,7,9},c)= {8,7,9}, 
0’({4,5,6,7,9},c)= {8,7,9}, 
0’({8,7,9},c)= {8,7,9} 


An algorithm for subset construction 


= Construction of DFSA =>, 8’,q,’,F’> from NFSA A=<@,>, 6, q,,F> 


— Q'-(B| B c ®}, if unconstrained can be 2? 
with |®| = 33 this could lead to an FSA with 2? states 
(exceeds the range of integers in most programming languages) 


— Many of these states may be useless 


L= (a|b)a* U ab+a* 


An algorithm for subset construction 


= Construction of DFSA A‘=<®’, X, 8’,q,,F’> from NFSA A=<@,>, 6, q,F> 


— |ם)=יפי‎ BED}, if unconstrained can be 2% 
with |®| = 33 this could lead to an FSA with 2? states 
(exceeds the range of integers in many programming languages) 


— Many of these states may be useless 


L= (a|b)a* U ab+a* 


An algorithm for subset construction 


= Construction of DFSA A‘=<®’, X, 8’,q,,F’> from NFSA A=<@,>, 6, q,F> 


— |ם)=יפי‎ BED}, if unconstrained can be 2% 
with |®| = 33 this could lead to an FSA with 2? states 
(exceeds the range of integers in many programming languages) 


— Many of these states may be useless 


L= (a|b)a* U ab+a* 


| 
No transition can ever 
enter these states 


An algorithm for subset construction 


= Construction of DFSA A‘=<®’, X, 8’,q,,F’> from NFSA A=<@,>, 6, q,F> 


— |ם)=יפי‎ BED}, if unconstrained can be 2% 
with |®| = 33 this could lead to an FSA with 2? states 
(exceeds the range of integers in many programming languages) 


— Many of these states may be useless 


b 
L= (a|b)a* U ab'a* Only consider states 
that can be traversed 
starting from q, 


An algorithm for subset construction 


" Basic idea: we only need to consider states B c ® that can ever be traversed 
by a string we X*, starting from q,‘ 


"= [e. those B c ® for which B = ò’ (qw), for some we X*, with ò’ the 
recursively constructed transition function for the target DFSA A’ 
" Consider all strings we X* in order of their length: 6, a,b, aa,ab,ba,bb, aaa,... 


1=2,3,4, ... (aa, ab, ba, bb, aaa, aab, aba, ...) 
Construction by increasing lengths of strings 
For each ae X, construct transitions to known or new states according to ó 


New target states (A) are placed in a queue (FIFO) 
Termination: no states left on queue 


An algorithm for subset construction 


DETERMINIZE(®, >, ó, dy F) 
do — Qo . 
$' c íq) Complexity 
ENQUEUE(Queue, q, °) f 
while Queue Z @ Maximal number of states 
S — DEQUEUE(Queue) placed in queue is 2^ 
for ae X SO, Worst case runtime is exponential 
6')5,8( = UU, O(r,a) 
if 9 (S) e D * determinization is a costly operation, 
p’ e D’ o 8'(S,a) * but results in an efficient FSA 
ENQUEUE(Queue, 5’(S,a)) (linear in size of the input) 


if d(Sa)Nn F Z Ó ° avoids computation of isolated states 
F‘ — {0 (S,a)} 
fi 
fi 
return ($',X, 8’, do, F’) 


Actual run time depends on the 
shape of the NFSA 


Minimization of FSA 


" (Can we transform a large automaton into a smaller one 
(provided a smaller one exists)? 


= If Ais 8 DFSA, is there an algorithm for constructing an 
equivalent minimal automaton A. from A? 


A 1s equivalent to A‘ 
i.e., L(A) = L(A °) 


A‘ is smaller than A 
i.e., || > |] 


= A can be transformed to A‘: 
— States 2 and 3 in A “do the same job”: once A 18 in state 2 or 3, it 
accepts the same suffix string. Such states are called equivalent. 
— Thus, we can eliminate state 3 without changing the language of A, 
by redirecting all arcs leading to 3 to 2, instead. 


Minimization of FSA 


" A DFSA can be minimized 
if there are pairs of states q,q'€ ® that are equivalent 


" Two states q,q’ are equivalent iff they accept the same right language. 


= Right language of a state: 
— For A=<@;,>, à, q,F> a DFSA, the right language L~(q) of a state 
ge Pis the set of all strings accepted by A starting in state q: 
L^(q) = {we X* | 8*%(q,w) e F; 
— Note: L *(q,) = L(A) 
= State equivalence: 
— For A=<®,%, ó, q,F> a DFSA, 
if q,q'e D, q and q' are equivalent (q =q’) iff L?(q) = L ° (q) 
— =1s an equivalence relation (1.e., reflexive, transitive and symmetric) 
— = partitions the set of states P into a number of disjoint sets Q, .. Q, of 


equivalence classes s.th. VJ. 


i=l..m 


Q. = P and q = q' for all g,q’€ Q, 


Partitioning a state set into equivalence classes 


All classes C. consist 
of equivalent states q; ; n 


that accept identical 
right languages L'*(q;) 


Whenever two states q.q ° 
belong to different classes, 


L>(q) 5 L^?(q*) 


Equivalence classes Minimization: = 
on state set defined by = elimination of equivalent states 


Minimization of a DFSA 


A DFSA A=<@,>, 6, g,,F> that contains equivalent states q, q’ 


can be transformed to a smaller, equivalent DESA A’=<®’,, 8°, q,,F’> where 


* D =g}, F=F\{q’}, 


* 


0’(S,a) = q if ó(s,a) = q’; 
©’ is like ó with all transitions to q’ redirected to që (s.a) = ó(s,a) otherwise 


= Two-step algorithm 
— Determine all pairs of equivalent states q,q’ 
— Apply DFSA reduction until no such pair q,q’ is left in the automaton 
= Minimality 
— The resulting FSA is the smallest DFSA (in size of ®) that accepts L(A): 
we never merge different equivalence classes, so we obtain one state per class. 
= We cannot do any further reduction and still recognize L(A). 
= As long as we have >] state per class, we can do further reduction steps. 


A DFSA A=<%;,>, 6, g,,F> is minimal iff there is no pair of distinct but equivalent 
states € D, ie. Vq,qeb: qq e q=q 


Example 


Example 


Algorithm 


MINIMIZE(®, X, 5, qu F) 


main 
EqClass[] — PARTITION(A) 
q, — EqClass[q,] MINIMIZE 
for <q,a,q‘ >E 8 * PARTITION(A): 
6)0,8( — min(EqClass[q']) - determines all eqclasses of states in A 
for qe P - returns array EqClass[q] of eq. classes of q 
if q  min(EqClass[q]) ° redirect all transitions <q,a,q‘>e 6 to point 
P c digi to min(EqClass[q']) 
if qe F ° remove all redundant states from ® and F 


F e F\{q} 


Computing partitions: Naive partitioning 


NAIVE PARTITION(®, X, 8, q, F) 


for each qe P 
EqClass[q] — {q} 
for each qe D 


for each q'e ® 
if EqClass[q] # EqClass[q*] ^ CHECKEQUIVALENCE(A,,A,.) = True 
EqClass[q] — EqClass[q]  EqClass[q*] 
EqClass[q‘] — EqClass[q] 


NAIVE PARTITION 
* array EqClass of pointers to disjoint sets for equivalence classes 
° first loop: initializing EqClass by {q}, for each qe D 
° second nested loop: if we find new equivalent states q = q', 
we merge the respective equivalence classes EqClasses 
and reset EqClass[q] to point to the new merged class 
Runtime complexity: loops: O(|?? CheckEquivalence: 0(|®/?- |X|) = O(|El*- |>) ! 


Computing partitions: Dynamic Programming 


= Source of inefficiency: naive algorithm traverses the whole automaton to 
determine, for pairs q,q‘, whether they are equivalent 


= Results of previous equivalence checks can be reused 


1105 q‘, L?(q) = L (q), 
therefore, 
for all <p,p‘> s.th. ó '(p.a)=q and 8"(p’,a)=q’ 


for some ae >, p = p’. 


" Thus, non-equivalence results can be propagated 
° Propagation from final/non-final pairs: L^(q) # L^(q') ifqeFAq'eF 
° Propagation from pairs <q,q’> where ó(q,a) is defined but 5(q’,a) is not. 


Propagation of non-equivalent states 


LocalEquivalenceCheck(q,q‘) 

if (qe F and q‘éF) or (qe F and q‘E F) 

if “os id Disi qose — For some ae X, 6)0,8( is defined, 6) ,8( is not 
is defined 
return (False) 

return (True) 


Non-equivalence check for states <q,q‘> 
— Only one of q, q' is final 


Propagation (I): Table filling algorithm 


(Aho, Sethi, Ullman) 
PROPAGATE(q,q*) 

" represent equivalence relation as a table 

for ae > . 

E Equiv, cells filled with boolean values 
for ped (q,a), | 
NOSE " initialize all cells with 1; 
torp =o (da) reset to 0 for non-equivalent states 
if Equiv[min(p,p’),max(p,p’)]=] Ç 


Equiv[min(p,p’),max(p,p’)] — 0| ` main loop: call of PROPAGA TE for non- 
PROPAGATE(p,p‘) equivalent states from LocalEquivalenceCheck 


Propagation of non-equivalent states 


LocalEquivalenceCheck(q,q‘) Runtime Complexity: O(|®]? - |>) 
if (qe F and q'éF) or (qe F and q'eF) “ PROPAGATE is never called twice on a 
return (False) given pair of states (checks 
if Jac > s.th. only one of 6)0,8(, 5(q’,a) Equiv[q,q']-1) 
is defined Space requirements: 0(|®|) cells 
return (False) 
return (True) TableFillingPARTITION(9, >, ó, q,, F) 
for q,q'e P, q«q' 
Equiv[q,q'] > 1 
for qe D 


PROPAGATE(q,q‘) 
for acd 


for q'e D, q«q' 
if Equiv[q,q']-1 and 
LocalEquivalenceCheck(q,q’)=False 
Equiv[q,q'] = 0 
PROPAGATE(q,q‘) 


for pe (q,a), 
for ped (q',a) 
if Equiv[min(p,p’),max(p,p’)|=1 
Equiv[min(p,p’),max(p,p’)] — 0 
PROPAGATE(p,p‘) 


More optimizations 


" Hopcroft and Ullman: space requirement O(|®!), by 
associating states with their equivalence classes 

"= Hopcroft: Runtime complexity of 0(|@| - log|®! - |>), by 
distinction of active/non-active blocks 


Brzozowski's Algorithm 


Minimization by reversal and determinization 


L(A) = 
L(A) ium 
^em ^ 
Reversal 


° Final states of A` : set of initial states of A 
° Initial state of A : F of A 

° ó (q.a) = {pe D | d(p,a)=q } 

LA) = LA) 


L(A) j reverse 


Why does it yield a minimal DFSA A*? 


Consider the right languages of states q, q‘ in NFSA (A): 


* [f for all distinct states q, q‘ L>(q) = LG), i.e. L^(q) A L?(q') = Ø, 
it holds that each pair of states q,q’ recognize different right languages, 
and thus, that the NFSA (A) satisfies the minimality condition for a DFSA. 


* If there were states q,q’ in NFSA (AT) s.th. L^(q) ^ L^(q?) z Ø, 
there would be some string w that leads to two distinct states in DFSA AT, 
This contradicts the determinicity criterion of a DFSA. 


* Determinization of NFSA (A)! does not destroy the property of minimality 


Applications of FSA: String Matchine 


= Exact, full string matching 
— Lexicon lookup: search for given word/string in a lexicon 
— Compile lexicon entries to FSA by union 
— Test input words for acceptance in lexicon-FSA 


compile recognition/application/lookup 
to FSA of input word w in/to FSA A, 
(qo,W) me CR €) Is true, 
with q, initial state and qc F 
transition table! 


traversal and recognition (acceptance) 


Applications of FSA: String Matchine 


= Substring matching 
— Identify stop words in stream of text 


— Stem recognition: small, smaller, smallest 


" Make use of full power of finite-state operations! 
— Regular expression with any-symbols for text search 
= ?* small( 6 | er | est) ?* 
= ?%) 8| 6 | ...) ?* 
- Compllation to NFSA, convert to DFSA 
— Application by composition of FST with full text 
= FSA o FST an : 11 defined, search term is substring of text 


text stream 


Application of FSA: Replacement 


= (Sub)string replacement 
— Delete stop words in text 
— Stemming: reduce/replace inflected forms to stems: smallest > small 
— Morphology: map inflected forms to lemmas (and PoS-tags): 
good, better, best > good+Adj 
— Tokenization: insert token boundaries 


=> Finite-state transducers (FST) 


From Automata to Transducers 


Automata 


recognition of an input string w 


OROROORO 
define a language 


accept strings, with transitions 
defined for symbols € > 


Transducers 


recognition of an input string w 


generation of an output string w < 
1 6 8 M e 
ka) Ka) a) (a) a) 
define a relation between languages 


equivalent to FSA that accept pairs of 
strings, With transitions defined for 
pairs of symbols <x,y> 
operations: replacement 

° deletion <a, 6<, 2 e> -{e} 

° insertion <, a>, a e> -16[ 

° substitution <a, b», a,b e>, a Z b 


Transducers and composition 


= AnFSTs encodes a relation between languages 


= A relation may contain an infinite number of ordered pairs, 
e.g. translating lower case letters to upper case 


a:A, 


& >s. a lower/upper case transducer 
c:C,... 


a path through the lower/upper 
אא‎ -Y ZL, Z:Z 24 
OPO? OOOO case transducer, for string xyzzy 


= The application of a transducer to a string may also be viewed as 
composition of the FST with the (identity relation on the string) 


(aa) (a) Ca) a) OS 0 

1 6 f t £ € 

RER => (a) (a) (a) (a) (a) (a) + (e) 
OTOFOrOFTO 
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Off-the-shelf finite-state tools 


= Xerox finite-state tools 


— http://www.xrce.xerox.com/competencies/content-analysis/fst/ 
> Xerox Finite State Compiler (Demo) 


— XFST Tools (provided with Beesley and Karttunen: Finite-State 
Morphology, CSLI Publications) 


= Geertjan van Noord’s finite-state tools 

— http://odur.let.rug.nl/-vannoord/Fsa/ 
" FSA Utilities at John Hopkins 

— http://cs.jhu.edu/-jason/406/software.html 
= AT&T FSM Library 

— http://www.research.att.com/sw/tools/fsm/ 


EE d 


Research Centre Europe 


» Research - Content Analysis 


Content Analysis, : XEROX FINITE-STATE COMPILER 
Home, : 
| FST, This page allows you to create a finite-state network from a regular 
Machine Learning, : expression and to apply the resulting network to strings. You can also 


Parsing ₪ Semantics» ^ try out some of our Examples. j 


Demos 


People 
Document Structure , 
Image Processing » » COMPILATION : 
Work Practice 
Past Projects » Type a regular expression in this area and submit it to the compiler by 
Demos, pressing the SUBMIT button. The compilation result will appear in a new 
browser window. Clear with RESET. 


Regular expression 


(alb)*c b+ (alc) d 


(alb)*c b+ (alc) d 


M Display the structure of the network (if it has not more than 50 Network 
states). 


484 bytes. 5 states, 9 arcs, Circular. 
Sigma: a p c d 


s0: a -> 80, b -> 30, c -> sl. 

si: b => s2. 

s2: a -> 83, b -> s2, c => s3, d => fs4. 
49: d -> fs4. 

fs4: (no arcs) 


Exercises 


" Write a program for acceptance of a string by a DFSA. 
Then extend it to a finite-state transducer that can translate a surface form to lemma 
t POS, or between upper and lower case. 


" Determinize the following NFSA by subset construction. tee 
A, =<{p,g,r,s},{a,b},ö ,p,{s}> where ó is as follows: EI A si 

z Construct an NFSA with £-transitions from the regular expression (alb)ca*, 
according to the construction principles for union, concatenation and kleene star. 
Then transform the NFSA to a DFSA by subset construction. 


= Find a minimal DFSA for the FSA A=<{A,..,E}, {0,1}, 6,,A,{C,E}> 
(using the table filling algorıthm by propagation). 


